Maximal Common Subsequences and Minimal Common Supersequences Proposed Running Head: Common Subsequences and Supersequences
نویسندگان
چکیده
The problems of nding a longest common subsequence and a shortest common supersequence of a set of strings are well-known. They can be solved in polynomial time for two strings (in fact the problems are dual in this case), or for any xed number of strings, by dynamic programming. But both problems are NP-hard in general for an arbitrary number k of strings. Here we study the related problems of nding a shortest maximal common subsequence and a longest minimal common supersequence. We describe dynamic programming algorithms for the case of two strings (for which case the problems are no longer dual), which can be extended to any xed number of strings. We also show that both problems are NP-hard in general for k strings, although the latter problem, unlike shortest common supersequence, is solvable in polynomial time for strings of length 2. Finally, we prove a strong negative approximability result for the shortest maximal common subsequence problem. 3 Key words: string algorithms, subsequence, supersequence, dynamic programming, NP-hard optimisation problems, approximation algorithms. 1 Introduction A subsequence of a string is any string that can be obtained from by the deletion of zero or more symbols. A supersequence of is any string that can be obtained from by the insertion of zero or more symbols. Given a set S of k strings, a common subsequence of S is a string that is a subsequence of every string in S and a common supersequence of S is a string that is a supersequence of every string in S. The longest common subsequence (LCS) and shortest common supersequence (SCS) problems are classical problems of stringology, with important applications in computational biology, le comparison, data compression, etc.A common subsequence is maximal if no proper supersequence of is also a common subsequence of S | in other words, if is not contained as a subsequence in any longer common subsequence of S. A shortest maximal common subsequence (smcs) of S is a maximal common subsequence of shortest possible length. Clearly, a maximal common subsequence of greatest possible length is just a longest common subsequence, a concept that has been widely explored in the literature. A common supersequence is minimal if no proper subsequence of is also a common supersequence of S | in other words, if does not contain as a subsequence any shorter common supersequence of S. A longest minimal common supersequence (lmcs) of S is a minimal common supersequence of longest possible length. Clearly, a minimal common supersequence of shortest possible length is just a shortest common supersequence. Example If 1 = abc, 2 = bca, the maximal common subsequences are a, bc, and the unique smcs is a, of length 1. The minimal common supersequences are bcabc, abca, bacbac, and the unique lmcs is bacbac, of length 6. In this paper, we study the shortest maximal common subsequence problem (SMCS) and and the longest minimal common supersequence problem (LMCS) from the complexity point of view. The problems are of genuine interest in their own right, although the original motivation for the study of maximal common subsequences and minimal common supersequences was 4 in the context of approximation algorithms for LCS and SCS. For instance, any approximation algorithm for the LCS of a set of strings will return a common subsequence of the strings, which can then be made maximal by the insertion of zero or more additional characters. The question arises as to the extent to which the length of such a maximal common subsequence might di er from the length of a longest common subsequence. A similar situation holds in the case of approximation algorithms for the SCS of a set of strings. We show that, like the LCS and SCS problems, both of these new problems can be solved in polynomial time by dynamic programming for k = 2 (and, by extending the algorithms, for any xed value of k). However, the dynamic programming algorithms are not quite so straightforward as those for the LCS and SCS problems, and have complexities O(m2n) and O(mn(m + n)) respectively for strings of lengths m and n. Note that the existence of polynomial-time algorithms for the SMCS and LMCS problems in the case of two strings is by no means obvious. Consider the problem of nding a maximum cardinality matching in a bipartite graph. This problem is well known to be solvable in polynomial time, whereas the problem of nding a minimum maximal bipartite matching is NP-hard [YG80]. We also show that, as is the case for the LCS and SCS problems, SMCS and LMCS are NP-hard when the number of strings k becomes a problem parameter. However, we pinpoint one interesting di erence between the SCS and LMCS problems, namely that the latter, unlike the former, is solvable in polynomial time for strings of length 2. Furthermore, we prove a strong negative result regarding the likely existence of good polynomialtime approximation algorithms for SMCS in the case of general k. We leave open the possibility of a good polynomial-time approximation algorithm for LMCS. 2 The SMCS problem for two strings When restricted to the case of just two strings and of lengths m and n respectively, the classical LCS and SCS problems are easily solvable in O(mn) time by dynamic programming. Indeed, in this case, the LCS and SCS problems are dual, in that s = m + n l, where l and s are the lengths of a longest common subsequence and shortest common supersequence respectively. Much e ort has gone in to nding re nements of dynamic programming and other approaches which lead to improvements in 5 complexity in many cases. See, for example, [AG87, Hir75, Hir77, HS77, Ukk85, WMMM90]. The following simple example serves to illustrate the fact that there is no obvious corresponding duality between the SMCS and LMCS problems in the case of two strings. Example Let = abc, = dab. Then the only maximal common subsequence is ab of length 2, while dabc, abcdab and abdcab are some of the minimal common supersequences, the latter two being lmcs's. It is true, however, that if is a maximal common subsequence of length r of and , then forming an alignment of and in which the elements of are matched reveals a minimal common supersequence, say, of and of length m + n r. For if were not minimal, then some single symbol c could be removed from to give a shorter common supersequence 0. But the symbol of or (or both) represented by c in the alignment must be matched with new symbols in the alignment corresponding to 0 , leading to a contradiction of the maximality of . Hence, if l0 is the length of an smcs, and s0 the length of an lmcs, it follows that s0 m+ n l0. That the inequality can be strict is shown by the above example. Hence the question arises as to whether either or both of the SMCS and LMCS problems can be solved in polynomial time, by dynamic programming or otherwise. In this section we describe a polynomial-time algorithm to determine the length of an smcs of two strings, and in the following section a polynomialtime algorithm to determine the length of an lmcs of two strings. It turns out that these algorithms will determine the lengths of all maximal common subsequences and all minimal common supersequences respectively. They will also allow the construction of an scms, and indeed of all the maximal common subsequences (respectively an lmcs, and all minimal common supersequences) of the two strings. The algorithms use a dynamic programming approach based, as usual, on a table that relates the ith pre x i = [1 : : : i] of and the jth pre x j = [1 : : : j] of , for i = 1; : : : ; m, j = 1; : : : ; n, wherem;n are the lengths of ; respectively. However, as we shall see, for each i; j we need retain rather more information than merely the lengths of the maximal common subsequences, or of the minimal common supersequences, of i and j. The SMCS Algorithm 6 Given a string and a subsequence of , we de ne sp( ; ) = length of the shortest pre x of that is a supersequence of : Given strings ; of lengths m and n respectively, we de ne the set Sij, for each i = 1; : : : ; m, j = 1; : : : ; n, by Sij = f(r; (x; y)) : i and j have a maximal common subsequence of length r, and sp( ; ) = x; sp( ; ) = yg with S00 = f(0; (0; 0))g: For string of length m, position i and symbol a, we de ne next (i; a) = ( minfk : [k] = a; k > ig if such a k exists m+ 1 otherwise: If is a string and a a symbol of the alphabet, we denote by + a the string obtained by appending a to . Likewise, if the last character of is a, we denote by a the string obtained by deleting the nal a from . The algorithm for SMCS is based on a dynamic programming scheme for the sets Sij de ned above. So evaluation of Smn reveals the length of an smcs, and also nds the lengths of all maximal common subsequences of and (indeed of all maximal common subsequences of all pairs of pre xes of and ). Furthermore, the use of suitable tracebacks in the array of Sij values, can be used to generate, not only an smcs, but all maximal common subsequences. The basis of the dynamic programming scheme is contained in the following theorem: Theorem 1 (i) If [i] = [j] = a then Sij = f(r; (next (x; a); next (y; a))) : (r 1; (x; y)) 2 Si 1;j 1g: (ii) If [i] 6= [j] then Sij = f(r; (x; j)) 2 Si 1;jg [ f(r; (i; y)) 2 Si;j 1g [ (Si 1;j \ Si;j 1): Proof (i) Suppose (r 1; (x; y)) 2 Si 1;j 1 and that 0 is a maximal common subsequence of i 1 and j 1 of length r 1 with sp( ; 0) = x 7 and sp( ; 0) = y. Then it is immediate that = 0 + a is a maximal common subsequence of i and j , and that sp( ; ) = next (x; a), sp( ; ) = next (y; a). On the other hand, suppose that is a maximal common subsequence of length r of i and j . Then the last symbol of is a, and 0 = a is certainly a common subsequence of i 1 and j 1. If it were not maximal, then some supersequence of 0 would be a common subsequence of i 1 and j 1, and therefore + a, a supersequence of , would be a common subsequence of i and j , contradicting the maximality of . So (r 1; (sp( ; 0); sp( ; 0)) 2 Si 1;j 1 and (r; (sp( ; ); sp( ; )) 2 Sij with sp( ; ) = next (sp( ; 0); a) and sp( ; ) = next (sp( ; 0); a). (ii) Suppose (r; (x; j)) 2 Si 1;j, and that is a maximal common subsequence of i 1 and j of length r with sp( ; ) = x, sp( ; ) = j. Then is a common subsequence of i and j , and must be maximal since + [i] cannot be a subsequence of j. A similar argument holds for (r; (i; y)) 2 Si;j 1. So f(r; (x; j) 2 Si 1;jg [ f(r; (i; y)) 2 Si;j 1g Sij. Further, if (r; (x; y)) 2 Si 1;j \ Si;j 1, then there is a string with sp( ; ) = x < i, sp( ; ) = y < j, of length r, which is a maximal common subsequence of i and j 1, and of i 1 and j. So must also be a maximal common subsequence of i and j. For any supersequence of that is a subsequence of i and j must either be a subsequence of i and j 1, or of i 1 and j . On the other hand, suppose that is a maximal common subsequence of length r of i and j . case (iia) sp( ; ) = i. Then is a maximal common subsequence of i and j 1, and so (r; (i; y)) 2 Si;j 1 for some y. case (iib) sp( ; ) = j. Then is a maximal common subsequence of i 1 and j, and so (r; (x; j))2 Si 1;j for some x. case (iic) sp( ; ) < i; sp( ; ) < j. Then is both a maximal common subsequence of i 1 and j , and of i and j 1. So (r; (sp( ; ); sp( ; )))2 Si 1;j \ Si;j 1. This completes the proof of the theorem. 2 Recovering a Shortest Maximal Common Subsequence The recovery of a particular smcs involves a standard type of traceback through the dynamic programming table from cell (m;n), during which the sequence is constructed in reverse order. To facilitate this traceback, 8 each entry in position (i; j) in the table (for all i; j) should have associated with it, during the application of the dynamic programming scheme, one or more pointers indicating which particular element(s) in cells (i 1; j), (i; j 1) or (i 1; j 1) led to the inclusion of that element in cell (i; j). For example, if [i] = [j] = a, and (r 1; (x; y)) 2 Si 1;j 1 then (r; (next (x; a); next (y; a)) is placed in cell (i; j) with a pointer to the element (r 1; (x; y)) in cell (i 1; j 1). With these pointers, any path from an element (r; (x; y)) in cell (m;n) to the element in cell (0; 0) represents a maximal common subsequence of and of length r, namely the reversed sequence of matching symbols from the two strings corresponding to cells from which the path takes a diagonal step. Analysis of the SMCS Algorithm The number of cells in the dynamic programming table is essentially mn, so that if we could show that the number of entries in each cell was bounded by, say, min(m;n), and that the total amount of computation was bounded by a constant times the total number of table entries, then we would have a cubic time worst-case bound for the complexity of the algorithm. However, this turns out not to be the case, as the following example shows. Example Consider two strings of length n = p(p + 1)=2 + q over an alphabet = fa1; : : : ; ang, de ned as follows = 1 + 2 + + p + ap(p+1)=2+1 : : :an = p + p 1 + + 1 + an : : : ap(p+1)=2+1 where 1 = a1, 2 = a2a3, : : : , p = a(p 1)p=2+1 : : :ap(p+1)=2, and + denotes concatenation. In the dynamic programming table for strings and , position (n; n) contains the pq entries (r; (x; y)) for r = 2; : : : ; p+ 1, x = p(p+ 1)=2 + 1; : : : ; n, y = n+ 1+ p(p+1)=2 x. Here, entry (r; (x; y)) arises from the maximal common subsequence at+1 : : :at+r 1ax, where t = (r 2)(r 1)=2. With q = (p2), this gives (n3=2) entries in the (n; n)th cell. However, suppose that we wish to nd only the length of an smcs (and to construct such a sequence by traceback through the table). Then, if any particular cell in the table contains more than one entry (r; (x; y)) with the same (x; y) component, we may discard all but the one with the smallest r value. For if a maximal common subsequence has a pre x 0 such that 9 sp( ; 0) = x and sp( ; 0) = y, then to make as short as possible, 0 must be chosen as short as possible. Also, if the entries (r; (x; y)) in the (i; j)th cell are listed in increasing order of x, then they must clearly also be in decreasing order of y, and therefore, since x i, y j, the number of such entries with distinct (x; y) components cannot exceed min(i; j). Further, it is easy to see that by processing the lists of cell entries in this xed order, the amount of work done in computing the contents of cell (i; j) is, in case (i) bounded by a constant times the number of entries in cell (i 1; j 1), and in case (ii) bounded by a constant times the sum of the numbers of entries in cells (i 1; j) and (i; j 1). (In case (i), this assumes precomputation of the tables of next values, which can easily be achieved in O(nj j) time for a string of length n, where is the alphabet.) In conclusion, the length of an smcs can be established by a suitably amended version of the above dynamic programming scheme in O(m2n) time in the worst case, for strings of lengths m and n (m n). Furthermore, such a subsequence can also be constructed from the dynamic programming table without increasing that overall time bound. But it remains open whether the lengths of all maximal common subsequences can be established within that time bound. A trivial bound of O(m3n) applies in that case, since the number of entries in each cell is certainly bounded by m2. 3 The LMCS problem for two strings The LMCS algorithm is not dissimilar in spirit to the SMCS algorithm, and there is a certain duality involving the terms in which the algorithm is expressed. Given strings and , we de ne lp( ; ) = length of the longest pre x of that is a subsequence of : Given strings ; of lengths m and n respectively, we de ne the set Tij, for each i = 0; : : : ; m, j = 0; : : : ; n, by Tij = f(r; (x; y)) : there exists a minimal common supersequence of i and j, of length r, such that lp( ; ) = x; lp( ; ) = yg. Finally, for string , position i and symbol a, we de ne f (i; a) = ( i+ 1 if [i+ 1] = a i otherwise: 10 The algorithm for LMCS is based on a dynamic programming scheme for the sets Tij de ned above. So evaluation of Tmn reveals the length of an lmcs, but also nds the lengths of all minimal common supersequences of and (indeed of all minimal common supersequences of all pairs of pre xes of and ). Furthermore, the use of suitable tracebacks in the array of Tij values, can be used to generate, not only an lmcs, but all minimal common supersequences. The zero'th row and column of the Tij table can be evaluated trivially, as follows: Ti0 = f(i; (i; lp( ; i)))g (1 i m) and T0j = f(i; (lp( ; j); j))g (1 j n) with T00 = f(0; (0; 0))g: The basis of the dynamic programming scheme is contained in the following theorem: Theorem 2 (i) If [i] = [j] = a then Tij = f(r; (f (x; a); f (y; a)) : (r 1; (x; y)) 2 Ti 1;j 1g (ii) If [i] = a 6= b = [j] then Tij = f(r; (f (x; b); j)) : (r 1; (x; j 1)) 2 Ti;j 1g [f(r; (i; f (y; a)) : (r 1; (i 1; y)) 2 Ti 1;jg: Proof (i) Suppose (r 1; (x; y))2 Ti 1;j 1 and that 0 is a minimal common supersequence of i 1 and j 1 of length r 1 with lp( ; 0) = x and lp( ; 0) = y. Then it is immediate that = 0 + a is a minimal common supersequence, of length r, of i and j , and that lp( ; ) = f (x; a) and lp( ; ) = f (y; a). On the other hand, suppose that is a minimal common supersequence of length r of i and j. Then [r] = a, and 0 = a is certainly a common supersequence of i 1 and j 1. If 0 were not minimal, then some subsequence of 0 would be a common supersequence of i 1 and j 1, and therefore + a, a subsequence of , would be a common supersequence of i and j, contradicting the minimality of . So (r 1; (x; y)) 2 Ti 1;j 1 11 with x = lp( ; 0), y = lp( ; 0) and lp( ; ) = f (x; a), lp( ; ) = f (y; a). (ii) Suppose (r 1; (i 1; y)) 2 Ti 1;j, and that 0 is a minimal common supersequence of i 1 and j of length r 1 with lp( ; 0) = i 1, lp( ; 0) = y. (The argument is similar in the case (r 1; (x; j 1)) 2 Ti;j 1.) Then = 0 + a is a common supersequence of i and j with lp( ; ) = i and lp( ; ) = f (y; a). Further, must be minimal. For suppose that a subsequence of is a common supersequence of i and j . If were a subsequence of 0, then 0 would not be a minimal common supersequence of i 1 and j . So = 0 + a, where 0 is a subsequence of 0. So 0 cannot be a common supersequence of i 1 and j. If it is not a supersequence of i 1 then 0 + a cannot be a supersequence of i | a contradiction. If it is not a supersequence of j then, since 0 + a is a supersequence of j, we must have [j] = a | a contradiction. On the other hand, suppose that is a minimal common supersequence of length r of i and j . Then [r] = a or b. case (iia) [r] = a. It is immediate that lp( ; ) = i, for otherwise a would be a common supersequence of i and j . So 0 = a is a minimal common supersequence of i 1 and j with lp( ; 0) = i 1 and lp( ; 0) = y for some y such that lp( ; ) = f (y; a). case (iib) [r] = b. A similar argument shows that 0 = b is a minimal common supersequence of i and j 1 with lp( ; 0) = j 1 and lp( ; 0) = x for some x such that lp( ; ) = f (x; b). This completes the proof of the theorem. 2 Recovering a Longest Minimal Common Supersequence As in the case of an smcs, the recovery of a particular lmcs involves a traceback through the dynamic programming table from cell (m;n) to cell (0; 0), during which the sequence is constructed in reverse order. To facilitate the traceback, each entry in position (i; j) in the table (for all i, j) should have associated with it, during the application of the dynamic programming algorithm, one or more pointers indicating which particular element(s) in cells (i 1; j), (i; j 1) or (i 1; j 1) led to the inclusion of that element in cell (i; j). For example, if [i] = a = [j] and (r 1; (x; y)) 2 Ti 1;j 1, then (r; (f (x; a); f (y; a))) is placed in cell (i; j) with a pointer to the element (r 1; (x; y)) in cell (i 1; j 1). With these pointers, any path from an element (r; (x; y)) in cell (m;n) to the element in cell (0; 0) represents a minimal common supersequence of and of length r, namely the reversed sequence of symbols found by 12 recording [i] for a vertical or diagonal step from cell (i; j) and [j] for a horizontal step from cell (i; j). Analysis of the LMCS Algorithm As in the case of the SMCS algorithm, we can establish a cubic time bound for the restricted version of the LMCS algorithm that is designed to nd the length of an lmcs, and to construct such a common supersequence from the dynamic programming table. The trick again is the observation that, for this purpose, whenever (r; (x; y)) elements in the same cell have the same (x; y) component, only one need be retained, namely that with the largest r value. For if a minimal common supersequence has a pre x 0 such that lp( ; 0) = x and lp( ; 0) = y, then to make as long as possible, 0 should be chosen as long as possible. By this means we can restrict the number of elements in the (i; j)th cell to at most i + j, recalling that each such entry (r; (x; y)) has either x = i or y = j. This leads to a worst-case time bound of O(mn(m + n)) for this version of the algorithm. Again, it is not clear whether the lengths of all minimal common supersequences can be found in time better than O(mn(m+n)2) in the worst case, this arising from the obvious upper bound of (m+ n)2 on the number of elements in each cell of the table. 4 The SMCS problem for k strings It was rst proved by Maier [Mai78] that the problem of nding an LCS of k strings is NP-hard, even in the case of a binary alphabet. Further, Jiang and Li [JL92] showed that, unless P = NP, there cannot exist a polynomial-time approximation algorithm for LCS with a performance guarantee of k , for some > 0. As we shall see in Theorem 3, the transformation, from Independent Set, given by Maier [Mai78] to prove the NP-completeness for LCS also serves as a transformation from the Minimum Independent Dominating Set problem to SMCS. The former problem is also NP-hard [GJ79], and was shown by Irving [Irv91] not to have a polynomial-time approximation algorithm with a constant performance guarantee (if P 6= NP). Halld orsson [Hal93] strengthened this result to show that, if P 6= NP, then, for no < 1, can there exist a polynomial-time approximation algorithm with performance guarantee k . Maier's transformation has the property that the strings constructed have an LCS of length r if and only if the original graph has an independent set 13 of size r. If the given graph has k edges then the derived LCS instance has k+1 strings. It will therefore follow from the transformation, not only that SMCS is NP-hard, but also that this same strongly negative approximability result applies to the SMCS problem. Theorem 3 (i) The SMCS problem is NP-hard. (ii) If P 6= NP, then, for no < 1, can there exist a polynomial-time approximation algorithm for SMCS on k strings with performance guarantee k . Proof (i) Let G = (V;E), t, with V = fv1; : : : ; vng andE = fe1; e2; : : : ; emg, be an arbitrary instance of (the decision version of) the Minimum Independent Dominating Set problem. We construct an instance of SMCS as follows. Include in the set S of strings the string 0 = v1v2 : : :vn. For each edge ei = fvp; vqg (p < q) include in the set S of strings the string i de ned by i = v1v2 : : : vp 1vp+1 : : : vnv1v2 : : :vq 1vq+1 : : : vn: We claim that G has an independent dominating set of size t if and only if S has a maximal common subsequence of length t. To prove this claim, we must show that (a) if G has an independent dominating set U of size t then S has a maximal common subsequence of length t; (b) if S has a maximal common subsequence of length t then G has an independent dominating set of size t. To prove (a), assume that U = fvu1 ; vu2; : : : ; vutg is an independent dominating set of size t, where 1 u1 < u2 < : : : < ut n. The string = vu1vu2 : : : vut is clearly a subsequence of 0 and of any i not formed from an edge connecting two vertices of U . Since U is an independent set, is a common subsequence of S. If some supersequence, 0, of , is a common subsequence of S then observe for a contradiction, that 9 vp in 0 but not in , which is connected to vq 2 U , in G, by edge ej = fvp; vqg since U is dominating. Assuming p < q then the string j = v1v2 : : : vp 1vp+1 : : : vnv1v2 : : : vq 1vq+1 : : :vn. For 0 to be a subsequence of 0, vp must precede vq in 0. But this prevents 0 from being a subsequence of j. A similar contradiction is obtained if p > q is assumed. To prove (b), assume = vu1vu2 : : :vut , of length t, is a maximal common subsequence of the strings in S. The rst observation is that if vup and vuq are two symbols in and p < q then up < uq. For otherwise could not be a subsequence of 0. The elements of must form an independent set, U , of size t, in G. To see this, observe for a contradiction that if two elements, 14 vup and vuq (p < q), of are connected in G by edge ej = fvup ; vuqg then the string vupvuq , a subsequence of , would not be a subsequence of j (the string formed from ej) and hence would not be a subsequence of j. If U is not maximal then 9 U 0, an independent set of size t0 > t, and U U 0. This would imply there is some vertex vj that a member of U 0 but not a member of U . In that case, the string 0 = vu1 : : : vupvjvup+1 : : :vut , where up < j < up+1, a supersequence of , would be a common subsequence of all the strings in S, contradicting the maximality of . This concludes the proof of part (i). The proof of part (ii) follows from the observation that the reduction is linear [PY91], and therefore preserves the approximability of the Minimum Independent Dominating Set, and from the result of Halld orsson [Hal93] on the approximability of that problem. 2 5 The LMCS problem for k strings It is well-known that the problem of nding a shortest common supersequence of k strings is NP-hard [Mai78], even in many restricted cases, such as over a binary alphabet [RU81], even if all strings have the same length and all contain precisely two 1's [Mid92]; when all strings have length 3 and each character appears 2 times in total [Tim89]; when all strings have length 2 and each character appears 3 times in total [Tim89]. We now show that the LMCS problem is also NP-hard in the general case. But in contrast to the SCS problem we can give a linear-time algorithm for LMCS if the strings are of length 2. The complexity of LMCS for strings of constant length > 2 is left open. Theorem 4 The LMCS problem is NP-hard. Proof Let positive integers t, n = 3m, and the set of positive integers A = fs1; s2; : : : ; sng, with 1 4 t < si < 12t for each i, constitute an arbitrary instance of the 3-Partition problem. This problem asks whether there exists a partition fA1; A2; : : : ; Amg of A into sets of size 3 such that Psj2Ai sj = t 15 holds for all i 2 [1 : m]. The 3-Partition problem is known to NP-complete [GJ79]. Without loss of generality let Pi2[1:n] si = mt and m > 3. We construct an instance of LMCS as follows. For each i 2 [1 : m 1] include in the set S of strings the string i = bitdb(m i)t: For each si 2 A include in the set S the string i de ned by i = ui(bc1c2 : : : crd)si 1bc1c2 : : : crvi where r = 2mt. We claim that A has a partition fA1; A2; : : : ; Amg into sets of size 3 such that Psj2Ai sj = t holds for all i 2 [1 :m] i S has a minimal common supersequence of length t0 = n +mt(r + 2) +m 1. Assume that fA1; A2; : : : ; Amg is a partition of A into sets of size 3 such thatPsj2Ai sj = t holds for all i 2 [1 :m]. Without loss of generality assume Ai = fs3i 2; s3i 1; s3ig for i 2 [1 : m]. Then it is easy to verify that the string = 1 2 3d 4 5 6d : : :d n 2 n 1 n is a minimal common supersequence of S of length t0 = n+mt(r + 2) +m 1. On the other hand assume that S has a minimal common supersequence of length t0. We need the following claim. Claim 1: Every minimal common supersequence of f 1; 2; : : : ; m 1g has length 2(m 1)t+m 1. Proof of Claim 1 : Clearly contains at most m 1 d's. Since is minimal one of the following cases holds for each xed occurrence of b in : (1) = 1d 2 where the xed occurrence of b is in 1 and 1 contains exactly it b's for an i 2 [1 : m 1] or (2) = 1d 2 where the xed occurrence of b is in 2 and 2 contains exactly it b's for an i 2 [1 : m 1]. It follows that there can be at most 2(m 1)t b's in which proves the claim. Proof of Theorem 4 (continued): Since is minimal it contains each character ui and vi, i 2 [1 : n] exactly once. Clearly must have a subsequence which is a minimal common supersequence of f 1; 2; : : : ; ng. Since each i contains si 1 d's the number of d's in is Pni=1(si 1) = mt n. Let 0 be the subsequence of consisting of the characters b; c1; c2; : : : ; cr. Due to the minimality of we have 0 = (bc1c2 : : : cr)p for a p 2 [2 :mt]. Assume p mt 1 then has length 2n+(mt 1)(r+1)+mt n. Since is a minimal common supersequence of f 1; 2; : : : ; ng and by Claim 1 each minimal 16 common supersequence of f 1; 2; : : : ; m 1g has length 2(m 1)t+m 1 we have that has length 2n+(mt 1)(r+1)+mt n+2(m 1)t+m 1 = t0 r 1 + 2mt 2t < t0 since r > 2mt 2t 1. Thus we obtain 0 = (bc1c2 : : : cr)mt and j j 2n + mt(r + 1) +mt n = n +mt(r + 2). The minimality of now implies that the i's can only be embedded into pairwise disjoint substrings of . Let 00 be the subsequence of consisting of all b's and d's. Observe that 00 contains mt b's and that there is a d between at least every second pair of neighboring b's. Assume has an additional b not contained in the subsequence 00. Let ̂ be the sequence obtained after the insertion of this additional b into the corresponding position of 00. Note that ̂ contains exactly mt + 1 b's. If ̂ contains a d between at least every second pair of neighboring b's then it is easily checked that ̂ is a supersequence of each sequence i, i 2 [1 :m 1]. Otherwise ̂ is of the form ̂ = 1 2 3 with 2 = dbbbd (or ̂ = 1dbb or ̂ = bbd 3) and where 1 and 3 contain a d between at least every second pair of neighboring b's. Consequently at most one of the strings 1; 2; : : : ; m 1 is not a subsequence of ̂. This is the case i ̂ = 1dbbbd 3 and 1 contains it 1 b's and 3 contains (m i)t 1 b's for an i 2 [1 : m 1]. Then the string i cannot be embedded into ̂. It follows that must have either one additional d between the b's of 2 or one additional b to the left or right of 2 so that i is a subsequence of . We conclude that if contains mt+1 d's then it contains at most two characters more than | either two b's or a b and a d | and thus has length j j+ 2 n+mt(r + 2) + 2 < t0. From the above discussion we conclude that contains each character b; c1; c2; : : : ; cr exactly mt times. But then can have length t0 only if it contains mt n + m 1 d's. This is possible only if the subsequence 00 of which contains mt n d's has no d between the (it)th and (it + 1)th b for i 2 [1 : m 1]. Then must have an additional d between the (it)th and (it + 1)th b for i 2 [1 : m 1] because otherwise the string i is not a subsequence of . Let 0i = (bd)si 1b for i 2 [1 : n]. Now we have 00 = 0 (1) 0 (2) : : : 0 (n) for a permutation of [1 : n] such that 0 (3i 2) 0 (3i 1) 0 (3i) has n b's for i 2 [1 : m]. It follows that fA1; A2; : : : ; Amg with Ai = fs (3i 2); s (3i 1); s (3i)g for i 2 [1 : m] is the partition of A which was sought. 2 The LMCS problem for strings of length 2 17 We now describe a linear-time algorithm to determine an lmcs for strings of length 2. Let S be a set of strings each of length 2. Let G = (V;E) be the corresponding directed graph where V is the alphabet and (a; b) 2 E i ab 2 S. For ease of description we rst assume that each string in S is of the form ab with a 6= b, which means G has no loops. We further assume that G has no isolated nodes. The algorithm is as follows: (1) Compute the strongly connected components of G (Recall that a directed graph is strongly connected if for every two di erent nodes u; v there exists a directed path from u to v as well as from v to u). Represent each strongly connected component by any of its nodes and let V V be the set of all representatives of the strongly connected components of G. Let G = (V ; E ) be the directed graph with (a; b) 2 E i there exists nodes c and d in the strongly connected components represented by a and b respectively such that (c; d) 2 E. Clearly G contains no directed cycle. Let Vsou = fv1; v2; : : : ; vpg V and Vsin = fvp+1; vp+2; : : : ; vqg V be the sets of sources and sinks respectively in G (Note that G may have isolated nodes which we include in Vsin but not in Vsou). Set V 0 = Vsou[Vsin. (2) Set W = V 0 and = v1v2 : : : vq. (3) WHILE V 6= W DO Compute a directed path or cycle w0; w1; : : : ; wr in G with w0; wr 2 W and w1w2 : : :wr 1 2 V W . Set = wr 1wr 2 : : :w1 wr 1wr 2 : : :w1 and W = W [ fw1; w2; : : : ; wr 1g: (4) Return . Clearly steps (2) and (4) can be done in linear time. To see that step (1) can be done in linear time recall that the strongly connected components of a directed graph can be found in linear time. Also observe that if in step (3) V 6= W it is always possible to nd in linear time the required directed path or cycle containing at least one node in V W . Clearly the loop in step (3) is executed at most jV j times. It is not hard to nd a linear time 18 implementation of step (3) but we omit the details here. It follows that the entire algorithm runs in linear time. It remains to establish the correctness. The string returned by the algorithm has the form where = v1v2 : : : vq contains each character in V 0 and where is a permutation of the characters in V V 0. Hence contains each string of the form ab with a 2 V V 0 or b 2 V V 0. For each string ab 2 S with a; b 2 V 0 we have that a 2 Vsou and b 2 Vsin. Since each string ab with a 2 Vsou and b 2 Vsin is a subsequence of we conclude that is a common supersequence of S. Now we show that is minimal. Since G has no isolated nodes each character in V must be contained in . Since the characters in V 0 are contained only once in none of their occurrences can be omitted. All characters in V V 0 are contained exactly two times in . Consider step (3) of the algorithm when the directed path or cycle w0; w1; : : : ; wr is identi ed and the new string = wr 1wr 2 : : :w1 wr 1wr 2 : : :w1 is formed. Due to the edge (wr 1; wr) 2 E, and since wr is contained only in the old string , the left occurrence of wr 1 cannot be omitted. For each i 2 [1 : k 2] due to the edge (wi; wi+1) 2 E the left occurrence of wi and the right occurrence of wi+1 cannot be omitted. Finally due to the edge (w0; w1) 2 E the right occurrence of w1 cannot be omitted. So we have shown that is a minimal supersequence of S. Before we can establish that is an lmcs of S we need the following Lemma. Lemma 1 Let S be a set of strings each of length 2 where each string in S is of the form ab with a 6= b. Then for every common supersequence of S which contains every character at least twice it is possible to omit the leftmost occurrence of one of the characters in such that the sequence so obtained is a common supersequence of S. By symmetry, the same holds with \rightmost" instead of \leftmost". Proof Assume for a contradiction that there exists a common supersequence of S which contains every character at least twice and such that, for each character, if its left occurrence is omitted, the string so obtained is not a common supersequence. Without loss of generality assume further that each character occurs not more than twice in (otherwise all occurrences between the leftmost and the rightmost occurrence can be omitted). Let a be the leftmost character in S. Then there must exist a character b which has both occurrences between the a's in , i.e. = a:::b:::b:::a::: 19 This holds because otherwise the left occurrence of a can be omitted.Let b be such that (i) no character occurs twice between the occurrences ofb and (ii) no character has both its occurrences to the left of the leftmostoccurrence of b. Now there must be a character c which has one occurrenceto the left of the two b's and the other between the two b's in , i.e.= a:::c:::b:::c:::b:::a:::This holds because otherwise the leftmost occurrence of b can be omitted.Furthermore let c be such that no other character has both its occurrencesbetween the occurrences of c. Iterating this argument shows the existenceof a character d which has one occurrence between the occurrences of c andthe other one to the left of the leftmost occurrence of c and so forth. Butthis is not possible since is nite. 2Now we establish that is an lmcs for S. For a contradiction assumethat there exists a minimal common supersequence of S which is longerthan . Consequently there must exist a source or sink in G such thatall characters included in this component occur twice in . Without loss ofgenerality let this component be a sink. Let U V be the set of characterscontained in this component and let T S be the set of strings in S havingboth characters in U . Let be the subsequence of containing exactly thecharacters in U . Lemma 1 implies that we can omit one of the leftmostoccurrences of a character in and still have a common supersequence of T .Let 0 be the string that is obtained if we omit the corresponding occurrenceof a character in . Since the characters in U are contained in a sink allstrings in S T which contain a character in U are of the form ab witha 2 V U and b 2 U . Thus 0 must be a common supersequence of S whichcontradicts the minimality of .Recall that it was assumed that S contains no strings of the form aa andthus G has no loops. If this restriction is omitted the algorithm is changedas follows. If in step (1) a strongly connected component contains characterswith a loop then the component is reperesented by one of these characters.In step (2) we add a second occurrence for those characters in V 0 with aloop. Simple modi cations of the proof given above show the correctness ofthis algorithm. Thus the the following theorem is proved.Theorem 5 The LMCS problem for strings of length 2 is solvable in lineartime.20 6 Conclusion and open problemsWe have shown that, in the case of two strings (or indeed any xed numberof strings), a shortest maximal common subsequence and a longest minimalcommon supersequence can be found in polynomial time by dynamic pro-gramming. However, for general k, we have shown that nding a shortestmaximal common subsequence or a longest minimal common supersequenceof k strings is NP-hard. Further, unless P = NP, the length of a shortestmaximal common subsequence cannot be approximated, in polynomial time,within a factor of k for any < 1.It is natural to conjecture that nding a good approximation to thelength of a longest minimal common supersequence of k strings is just ashard, but we have no result of this kind.Finally, the problem of nding a longest minimal common supersequencein the case of strings of length 2 is shown to be solvable in polynomial time,in contrast to nding the shortest common supersequence, which is NP-hardin this case. The LMCS problem in the case of strings of xed length > 2remains open.References[AG87]A. Apostolico and C. Guerra. The longest common subse-quence problem revisited. Algorithmica, 2:315{336, 1987.[GJ79]M.R. Garey and D.S. Johnson. Computers and Intractability.Freeman, San Francisco, CA., 1979.[Hal93]M.M. Halldorsson. Approximating the minimum maximal in-dependence number. Information Processing Letters, 46:169{172, 1993.[Hir75]D.S. Hirschberg. A linear space algorithm for computing max-imal common subsequences. Communications of the A.C.M.,18:341{343, 1975.[Hir77]D.S. Hirschberg. Algorithms for the longest common subse-quence problem. Journal of the A.C.M., 24:664{675, 1977.[HS77]J.W. Hunt and T.G. Szymanski. A fast algorithm for com-puting longest common subsequences. Communications of theA.C.M., 20:350{353, 1977.21 [Irv91]R.W. Irving. On approximating the minimum independentdominating set. Information Processing Letters, 37:197{200,1991.[JL92]T. Jiang and M. Li. On the approximation of shortest commonsupersequences and longest common subsequences. Submittedfor publication, 1992.[Mai78]D. Maier. The complexity of some problems on subsequencesand supersequences. Journal of the A.C.M., 25:322{336, 1978.[Mid92]M. Middendorf. Zur Komplexitat von Einbettungsproblemenfur Wortmengen. PhD thesis, Fachbereich Mathematik, Uni-versitat Hannover, 1992.[PY91]C.H. Papadimitriou and M. Yannakakis. Optimization, ap-proximation, and complexity classes. Journal of Computer andSystem Sciences, 43:425{440, 1991.[RU81]K-J. Raiha and E. Ukkonen. The shortest common superse-quence problem over binary alphabet is NP-complete. Theo-retical Computer Science, 16:187{198, 1981.[Tim89]V.G. Timkovskii. Complexity of common subsequence andsupersequence problems and related problems. English Trans-lation from Kibernetika, 5:1{13, 1989.[Ukk85]E. Ukkonen. Algorithms for approximate string matching. In-formation and Control, 64:100{118, 1985.[WMMM90] S. Wu, U. Manber, G. Myers, and W. Miller. An O(NP) se-quence comparison algorithm. Information Processing Letters,35:317{323, 1990.[YG80]M. Yannakakis and F. Gavril. Edge dominating sets in graphs.SIAM J. Appl. Math., 38:364{372, 1980.22
منابع مشابه
Analysis of the Relationships among Longest Common Subsequences, Shortest Common Supersequences and Patterns and its application on Pattern Discovery in Biological Sequences
For a set of multiple sequences, their patterns, Longest Common Subsequences (LCS) and Shortest Common Supersequences (SCS) represent different aspects of these sequences' profile. Revealing the relationship between the patterns and LCS/SCS might provide us with a deeper view of the patterns. In this paper, we have showed that patterns LCS and SCS were closely related to each other. Based on th...
متن کاملProblems Related to Subsequences and Supersequences
We present an algorithm for building the automaton that searches for all non-overlapping occurrences of each subsequence from the set of subsequences. Further, we define Directed Acyclic Supersequence Graph and use it to solve the generalized Shortest Common Supersequence problem, the Longest Common Non-Supersequence problem, and the Longest Consistent Supersequence problem.
متن کاملCommon Subsequences and Supersequences and Their expected Length
Let f(n; k; l) be the expected length of a longest common subse-quence of l sequences of length n over an alphabet of size k. It is known that there are constants (l) k such that f(n; k; l) ! (l) k n, we show that (l) k = (k 1=l?1). Bounds for the corresponding constants for the expected length of a shortest common supersequence are also presented.
متن کاملFrom Clustering Supersequences to Entropy Minimizing Subsequences for Single and Double Deletions
A binary string transmitted via a memoryless i.i.d. deletion channel is received as a subsequence of the original input. From this, one obtains a posterior distribution on the channel input, corresponding to a set of candidate supersequences weighted by the number of times the received subsequence can be embedded in them. In a previous work it is conjectured on the basis of experimental data th...
متن کاملDecidability in the Logic of Subsequences and Supersequences
We consider first-order logics of sequences ordered by the subsequence ordering, aka sequence embedding. We show that the Σ2 theory is undecidable, answering a question left open by Kuske. Regarding fragments with a bounded number of variables, we show that the FO2 theory is decidable while the FO3 theory is undecidable. 1998 ACM Subject Classification F.4.1 Mathematical Logic, F.4.3 Formal Lan...
متن کاملOn the Approximation of Shortest Common Supersequences and Longest Common Subsequences
The problems of finding shortest common supersequences (SCS) and longest common subsequences (LCS) are two well-known NP-hard problems that have applications in many areas, including computational molecular biology, data compression, robot motion planning, and scheduling, text editing, etc. A lot of fruitless effort has been spent in searching for good approximation algorithms for these problem...
متن کامل